Library quality control
This document describes our approach to assessing sequenced library quality, including contamination, coverage, and proportion missing data.
Contamination
Kraken2: cross-species contamination
We used Kraken2 to classify reads (based on k-mers) against a standard database of known contaminants. Kraken2 identified several samples as having a large percentage of “classified” reads, largely as Vibrionales bacteria. Based on these, we excluded the following samples as potentially highly contaminated:
- YPM-IZ-110432
- YPM-IZ-110973
- YPM-IZ-110632
Sample YPM-IZ-110694-dry was a resequenced sample from dried tissue. For main analyses we will use the other sample, YPM-IZ-110694, and not the dried specimen.
Cross-contamination and duplicated samples
In addition to YPM-IZ-110694-dry, we sequenced one biological replicate (YPM-IZ-110474-2, a repeated extraction and library preparation of YPM-IZ-110474) and we generated two technical replicates (YPM-IZ-110269-A and -B, subsampled reads from YPM-110269). We used these replicates as quality control for library preparation and sequencing, and as a reference for identifying duplicated samples.
We assesses similarity by calculating relatedness scores using the King-robust metric in PLINK2. Using a cutoff of 0.35 to detect potential duplicate samples, we identified the following pairs:
- YPM-IZ-106941 and YPM-IZ-106944
- YPM-IZ-110878 and YPM-IZ-110879
- YPM-IZ-111013 and YPM-IZ-111015
- YPM-IZ-111018 and YPM-IZ-111019
Sample collection location did not differ between samples in these pairs. The latter sample in each pair was subsequently excluded.
We also identified two pairs of samples that had moderately high relatedness (0.07 and 0.13, respectively) with samples from improbably long distances. In both cases, DNA was extracted in the same batch, suggested cross contamination.
- YPM-IZ-110826 (Chile) with sample FM16644 (Canary Islands)
- YPM-IZ-110827 (Chile) with samples YPM-IZ-110878 and YPM-IZ-110879 (Gulf of California).
The former in each pair was subsequently excluded.
No other samples indicated a relatedness above the predicted score for a close relative.
Mapping statistics
We examined sample quality by mapping to our assembled genome. Samples with low mapping percentages indicate potential contamination or low quality. We counted the number of total reads, reads that mapped to the reference genome, and reads that were properly paired. From this we detected a significantly lower mapping percentage for the sample:
- YPM-IZ-110574
Tests using in silico PCR to extract 18S from this sample further indicated contamination from arthropod DNA. This sample will likewise be excluded.
Comparing statistics on mapped reads vs. properly paired reads shows no significant outliers in terms of quality.
FastQC: GC content
FastQC flagged no sequences as having low sequence quality, all were retained. But FastQC flagged several samples as having GC contents distinct from the expected distribution. Based on these results, we flagged seven samples that fail the GC test and excluded those from our strict analyses (but retained them in analyses presented in the supplementary figures).
Establishing appropriate filters
Site quality
We evaluated quality using Phred scores; most sites have high scores (e.g. a Phred encoded score of 40 indicates a 1 in 10,000 chance of an erroneous call). Based on this, we set the following:
- minimum site quality score = 40
Coverage
We estimated sequencing coverage by dividing the number of basepairs sequenced for each sample by the length of the reference genome, 3.3Gb. Samples were sequenced to variable depths, with estimated coverage ranging from 5-60x.
Samples with greater than 20x target coverage (dashed line) were considered high coverage, and were selected for genome size estimation and other high-coverage analyses.
We used ANGSD to estimate realized depth across samples and sites using a subset of genomic regions. We categorized samples into those with poor coverage distributions, those with low coverage (between 2 and 10x), moderate coverage (11 and 20x) and high coverage (>20x). Poor coverage samples, excluded from the strict analyses in the paper, were the following:
- KM5633
- KM5634
- KM5635
- YPM-IZ-110277
- YPM-IZ-110268
From the previous distributions we calculated peak depth, and then compared to total reads. We observed that GC distribution, as assessed using FastQC impact the relationship between input reads and realized depth.
We used realized depth to establish the cutoff for minimum and maximum depth across sites.
For BCF based analsyes (e.g Fst calculations), while for ANGSD analyses that are robust to low coverage, we did not perform any filtering based on minimum depth. Filters were set as follows:
- minimum depth = 2x
- maximum depth = 99x
Missingness
We examined the proportion of missing sites across samples. This was calculated using vcftools on a random subsample of sites from across genome regions. From this we identified several samples with a high proportion of missing sites, in particular:
- KM5634
We used the distribution of missing samples across sites to establish an appropriate filter for missingness. Here we set the following:
- tolerate missingness equal to or below 75%
This means we exclude sites for which >25% of samples are missing data.